2022-09-16

Members

Student Performance DataSet

  • The data is about student academic grades for math course of two Portuguese secondary public school during the 2005/2006 school year.
  • The dataset have 395 records in total with 5 numerical variables and 28 categorical variables.
    • The numerical variables include age, number of school absences, 1st Period grade(G1), 2nd period grade(G2) and final period grade(G3, target variable).
    • Since we have too many categorical variables, for our data analysis, we will only include school, sex and number of past failures.

First View of Data

Barplot for Categorical Variables

Barplot for Categorical Variables

  • There are more students from Gabriel Pereira school than Mousinho da Silveira school.
  • There are slightly higher number of female students compared to male students.
  • Most of the students never failed the other courses before.
Barplot for Numerical Variables

Barplot for Numerical Variables

  • Red dashed line as mean and a fitted density in purple.
  • Mean: Age 17, Number of absences 6, G1 grade 11, G2 grade 11, G3 grade 10
  • Variance: Age 1.63, Number of absences 64.05, G1 grade 11.02, G2 grade 14.15, G3 grade 20.99

EDA - Target Variable

Barplot of G3

Barplot of G3

EDA - Correlation

Correlation between five Numerical Variables

Correlation between five Numerical Variables

Leading Questions

  • Can PCA reduce the number of dimensions to model the students’ final grade?

  • Can we have a discriminant model to classify students whether they pass or fail based on the other features(e.g., first test grade, age, etc)?

PCA

We have conducted PCA on 4 numeric variables (G1, G2, Age, Number of Absences) to see if we can perform dimentionality reduction.

PC1 PC2 PC3 PC3
Standard deviation 8.0088 4.8287 1.4047 1.1888
Proportion of Variance 0.7061 0.2567 0.0217 0.0156
Cumulative Proportion 0.7061 0.9627 0.9844 1.0000

Fit Linear Regression

\[ Full\ model:G3_i = \beta_0 + \beta_1G1_i + \beta_2G2_i + \beta_3Age_i + \beta_4Absence_i + \epsilon_i \]

\[ PC12\ model:G3_i = \beta_0 + \beta_1PC1_i + \beta_2PC2_i + \epsilon_i \]

\[ PC123\ model:G3_i = \beta_0 + \beta_1PC1_i + \beta_2PC2_i + \beta_3PC3_i + \epsilon_i \]

Diagnostics of Full Model

Diagnostics of PC12 Model

Diagnostics of PC123 Model

Model Comparison

Check AIC, BIC and Adjusted \(R^2\)

AIC BIC Adjusted \(R^2\)
Full Model 1637.29 1661.16 0.83
ModelPC12 1689.13 1705.05 0.80
ModelPC123 1637.31 1657.20 0.83
  • In contrast with what we have seen in the scree plot before, PC123 model is the ‘best’ model so far.

  • We can reduce the number of variables to 3 instead of using 4 variables to predict the final grade.

LDA

  • Fail: G3 grade under \(10\). Total Fail students: \(130\).
  • Pass: G3 grade equal and above \(10\). Total Pass students:\(265\).

LDA Histogram

Split the dataset into two subsets. \(70\%\) as training, and \(30\%\) as test.
This plot shows some overlap observed between two groups. and also some of the observations can be observed clearly which group they belong to.

LDA Partition Plot

LDA Confusion Matrix

Training data

Fail Pass
Fail 80 15
Pass 18 175
  • Accuracy: 0.89, Precision: 0.84, Recall: 0.82.

Test data

Fail Pass
Fail 27 4
Pass 5 71
  • Accuracy: 0.92, Precision: 0.87, Recall: 0.84.

Problems

  • We tried using a Naive Bayes model to see if it gives us a better model, but the result is not as good as LDA, and we do not show the results here.

  • There are many other variables in the original dataset. They may have some potential relationship between these variables that effect G3 and may give us a totally different result, or explain any anomalies present within the data. We did not have access to these variables hence we do not know the impact it may have on our analysis.

  • The dataset used is only for students in 2005/2006, it is currently 2022… If we were able to acquire a dataset for 2021/2022 or within the last 5 years it would be of more benefit towards the current generation of high school students as the curriculum has and will change over time.

Conclusion

  • In response to the leading questions the analysis was very satisfactory. As we wanted to correctly use the tools of PCA and Discriminant Analysis for our dataset.

  • Using the PCA we successfully reduced the number of variables to 3 instead of 4. With the PC123 model working best.

  • The LDA gave us nice results as well especially with G1 and G2 variables (first and second period grades) having minimal wrong classifications. Which showed nicely in the plots and tables. This aligns with common sense that the performances in other periods would best classify G1 performance.

  • As mentioned in our problems on closer look at how the data was gathered it is unfortunately hard to see the relevance of this data in 2022, in NZ, post COVID and online learning. Nonetheless our actual analysis of this data proved to be successful.

Thank you!